pygame 2.6.1 (SDL 2.28.4, Python 3.11.9)
Hello from the pygame community. https://www.pygame.org/contribute.html
Environments & Solvers
Here we created a function to train and visualize a policy given an environment and a solver. The function passes in assorted arguments, such as the number of episodes, max steps per episode, and the output filepaths. As verbose is set to True, for each environment-solver pair, assorted metrics are printed, the convergence plot is displayed, we render the optimal policy, and a gif is displayed showing the optimal policy in action.
# set parameters (same for all solvers and envs)verbose =Truesave_metrics =True# function to train and visualize policy for a given environment and solverdef train_and_visualize_policy(env, solver, episodes, max_steps, verbose, save_metrics, gif_filename, image_filename, convergence_plot_filename):# train agent policy = solver.train(max_steps=max_steps, episodes=episodes, verbose=verbose)# print policyprint("\nOptimal Policy:")print(policy)# render optimal policy Utils.render_optimal_policy( env, policy, save_image=save_metrics, image_filename=image_filename )# run optimal policy Utils.run_optimal_policy( env, policy, save_gif=save_metrics, gif_filename=gif_filename )# plot convergence plot Utils.plot_convergence(solver.mean_reward, file_path=convergence_plot_filename)
Starting Monte Carlo ES training for 100 episodes...
Episode 10/100 - Average Return: 108.90, Average Q-Value Update: 12.5477
Episode 20/100 - Average Return: 156.90, Average Q-Value Update: 2.8513
Episode 30/100 - Average Return: 154.60, Average Q-Value Update: 1.4027
Episode 40/100 - Average Return: 154.90, Average Q-Value Update: 1.0536
Episode 50/100 - Average Return: 155.70, Average Q-Value Update: 0.4959
Episode 60/100 - Average Return: 156.50, Average Q-Value Update: 0.4762
Episode 70/100 - Average Return: 152.60, Average Q-Value Update: 1.0448
Episode 80/100 - Average Return: 153.80, Average Q-Value Update: 0.3804
Episode 90/100 - Average Return: 158.10, Average Q-Value Update: 0.3425
Episode 100/100 - Average Return: 155.00, Average Q-Value Update: 0.2713
Action distribution across episodes: {0: '0.081', 1: '0.919'}
Final Average Return: 150.70
Final Average Q-Value Update: 2.0866
Final Action Values (Q):
[[ 73.375 134.41791045]
[ 98.68817204 152.13265306]]
Optimal Policy:
[[0. 1.]
[0. 1.]]
Optimal policy visualization saved as ./outputs/optim_policy_boat_mc.png
Gameplay GIF saved as ./outputs/gameplay_boat_mc.gif
Average reward following optimal policy: 160.60
Convergence plot saved as ./outputs/convergence_plot_boat_mc.png
Training Temporal Difference algorithm for 250 episodes...
Episode 1/250 - Average Return: 8.00, Average Q-Value Update: 0.0329
Episode 26/250 - Average Return: 69.04, Average Q-Value Update: 0.1394
Episode 51/250 - Average Return: 78.08, Average Q-Value Update: 0.0502
Episode 76/250 - Average Return: 76.32, Average Q-Value Update: 0.0598
Episode 101/250 - Average Return: 77.28, Average Q-Value Update: 0.0504
Episode 126/250 - Average Return: 77.92, Average Q-Value Update: 0.0564
Episode 151/250 - Average Return: 74.88, Average Q-Value Update: 0.0268
Episode 176/250 - Average Return: 77.32, Average Q-Value Update: 0.0537
Episode 201/250 - Average Return: 75.48, Average Q-Value Update: 0.0742
Episode 226/250 - Average Return: 76.60, Average Q-Value Update: 0.0855
Training complete! Action distribution across episodes: [0.06656 0.93344]
Final Action Values (Q):
[[16.77819255 29.04173809]
[27.88650034 31.35323269]]
Optimal Policy:
[[0. 1.]
[0. 1.]]
Optimal policy visualization saved as ./outputs/optim_policy_boat_td.png
Gameplay GIF saved as ./outputs/gameplay_boat_td.gif
Average reward following optimal policy: 162.80
Convergence plot saved as ./outputs/convergence_plot_boat_td.png